| id | time | status | os1 | os2 | os3 | sm1 | sm2 | sm3 |
|---|---|---|---|---|---|---|---|---|
| 1 | 149 | 1 | 42.0017 | 0.8414 | 100 | 445 | 550.49 | 1366.01 |
| 2 | 269 | 1 | 42.0047 | 0.8411 | 100 | 445 | 550.11 | 1368.75 |
| 3 | 206 | 1 | 42.0073 | 0.8400 | 100 | 445 | 550.80 | 1356.97 |
Using Cox Proportional Hazards Model
April 22, 2025
What is it?
A statistical regression method specializing in modeling time-to-event predictions with survival data (Abeysekera and Sooriyarachchi 2009)
Is a method that can deal with censored data
Primarily used in the health field but has applications in predicting bank failure, the survival probability of machines, and insurance likelihood payouts
The model assumes that as time goes on, the survival probability will approach zero with no survivors (Asghar, Khalil, and Uddin 2024)
The proportional hazards assumption can limit the ability to correctly predict the effect of a variable (Jiang, Wu, and Li 2024)
The covariate selection can become biased and may not accurately represent the true data (Wang, Chang, and Lin 2025),(Zhang, Cheng, and Carrillo-Larco 2025)
The model cannot provide a specific value for when the event will happen, only the probability of when the event might happen
Concordance Index: \(C = \frac{c + \frac{t_x}{2}}{c + d + t_x}\)
CPH Model Hazard Function: \(h(t|\mathbf Z) = h_0(t)\text{exp}(\sum\limits_{k=1}^{p} \beta_kZ_k)\)
Proportional Hazards Ratio: \(\frac{h(t|\mathbf Z)}{h(t|\mathbf Z*)} = \text{exp}[\sum\limits_{k=1}^{p} \beta_k(Z_k - Z_k^*)]\)
Cumulative Hazard Function: \(H(x) = \int_0^x h(u) du\)
Survival Probability: \(S(t) = e^{-H(t)}\), where \(H(t)\) is the above cumulative hazard function
There are four assumptions for CPH:
Independence assumption
Non-informative Censoring Assumption
Linearity Assumption
Proportional Hazards Assumption
Model accuracy evaluation is done using the concordance index
The concordance index measures the amount of agreement between two variables
A value of 1 means all the pairs are correctly ordered while a value of 0 means no pairs are correctly ordered
Survival probability can be predicted at a specific time \(t\)
If the probability is \(\geq 50\)%, it is assumed the event has not occurred
If the probability is \(< 50\)%, it is assumed the event has occurred
The data selected for this project comes from a study on propagation modeling that NASA completed, specifically focusing on the engine two testing and training datasets (Saxena et al. 2008).
Each engine in the NASA data has an unknown amount of wear, manufacturing variation, and sensor noise
There are three operation setting fields and twenty-one sensor measurement fields
A column indicating status was added to both the testing and training datasets with 0 indicating the machine has not failed and 1 indicating the machine has failed
| id | time | status | os1 | os2 | os3 | sm1 | sm2 | sm3 |
|---|---|---|---|---|---|---|---|---|
| 1 | 149 | 1 | 42.0017 | 0.8414 | 100 | 445 | 550.49 | 1366.01 |
| 2 | 269 | 1 | 42.0047 | 0.8411 | 100 | 445 | 550.11 | 1368.75 |
| 3 | 206 | 1 | 42.0073 | 0.8400 | 100 | 445 | 550.80 | 1356.97 |
519 engines in the combined data
260 engines in training data
259 engines in testing data
| Metric | Value |
|---|---|
| Minimum | 128.00 |
| Median | 199.00 |
| Mean | 206.77 |
| Standard Deviation | 46.78 |
| Maximum | 378.00 |
The table provides the model number, the covariates used, the AIC and BIC from the stepwise regression if applicable, and the concordance index.
| Model | Covariates | AIC | BIC | Concordance |
|---|---|---|---|---|
| model 1 | All covariates | 2449.92 | 2547.71 | 0.6956658 |
| model 2 | os3, sm3 - sm5, sm8 - sm9, sm15 - sm18, sm20 | 2431.88 | N/A | 0.6917216 |
| model 3 | os3, sm3 - sm4, sm8-sm9, sm15, sm18 | N/A | 2461.31 | 0.6833401 |
| model 4 | os3, sm13 - sm14, sm19 | N/A | N/A | 0.5881404 |
Model for the continuing analysis will have all covariates except sm16 and sm19.
This assumption is tested by looking at the Martingale residuals (Martingale Residuals = Observed Events - Expected Events) and determining if there is a slope of zero
Based on the below plots, this assumption is met.
This assumption is tested by looking at the scaled Schoenfeld partial residuals (the value of the covariate - the expected value of the covariate at the time of failure) and examining the p-value
Based on the below table and plots, this assumption is met
| variable | chisq | df | p |
|---|---|---|---|
| os1 | 4.293034 | 1 | 0.0382688 |
| os2 | 2.188373 | 1 | 0.1390561 |
| sm1 | 3.083761 | 1 | 0.0790775 |
| sm2 | 1.092906 | 1 | 0.2958282 |
| GLOBAL | 32.955156 | 22 | 0.0625074 |
{r} ci| id | time | status | HazardRate | SurvivalProbability |
|---|---|---|---|---|
| 1 | 149 | 1 | 7.657873 | 0.8617456 |
| 2 | 269 | 1 | 2.585519 | 0.1719981 |
| 3 | 206 | 1 | 3.085786 | 0.4688056 |
| 4 | 235 | 1 | 1.488771 | 0.5305390 |
| 5 | 154 | 1 | 1.492792 | 0.9575895 |
| Percent | Time |
|---|---|
| 100% Survival | 128 |
| 75% Survival, | 201 |
| 50% Survival | 245 |
| 25% Survival | 299 |
| 10% Survival | 347 |